W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcHVycG9zZSBvZiBjcm9zcy12YWxpZGF0aW9uIGluIG1hY2hpbmUgbGVhcm5pbmc/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJUbyB0cmFpbiBtb2RlbHMgZmFzdGVyIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkNyb3NzLXZhbGlkYXRpb24gaXNuJ3QgcHJpbWFyaWx5IHVzZWQgdG8gdHJhaW4gbW9kZWxzIGZhc3Rlci4gSXRzIHByaW1hcnkgcHVycG9zZSBsaWVzIGluIGFzc2Vzc2luZyB0aGUgbW9kZWwncyBwZXJmb3JtYW5jZSByYXRoZXIgdGhhbiBleHBlZGl0aW5nIHRoZSB0cmFpbmluZyBwcm9jZXNzLiJ9LCB7ImFuc3dlciI6ICJUbyBhc3Nlc3MgYSBtb2RlbCdzIHBlcmZvcm1hbmNlIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ3Jvc3MgVmFsaWRhdGlvbiBhaWRzIGluIGVzdGltYXRpbmcgdGhlIG1vZGVsJ3MgcGVyZm9ybWFuY2Ugb24gdW5zZWVuIGRhdGEgYW5kIHByZXZlbnRzIG92ZXJmaXR0aW5nIG9yIHVuZGVyZml0dGluZy4ifSwgeyJhbnN3ZXIiOiAiVG8gcmVkdWNlIHRoZSBudW1iZXIgb2YgZmVhdHVyZXMiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiQ3Jvc3MtdmFsaWRhdGlvbiBkb2Vzblx1MjAxOXQgZGlyZWN0bHkgZm9jdXMgb24gcmVkdWNpbmcgdGhlIG51bWJlciBvZiBmZWF0dXJlcy4gRmVhdHVyZSBzZWxlY3Rpb24gb3IgZGltZW5zaW9uYWxpdHkgcmVkdWN0aW9uIHRlY2huaXF1ZXMgYXJlIHNwZWNpZmljYWxseSBlbXBsb3llZCBmb3IgdGhpcyBwdXJwb3NlLiAifSwgeyJhbnN3ZXIiOiAiVG8gaW5jcmVhc2UgbW9kZWwgY29tcGxleGl0eSIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJDcm9zcy12YWxpZGF0aW9uIGRvZXNuJ3QgYWltIHRvIGluY3JlYXNlIG1vZGVsIGNvbXBsZXhpdHkuIEluIGZhY3QsIGl0IGhlbHBzIGluIHByZXZlbnRpbmcgb3Zlcmx5IGNvbXBsZXggbW9kZWxzIGJ5IGFzc2Vzc2luZyB0aGVpciBwZXJmb3JtYW5jZSBvbiBkaWZmZXJlbnQgc3Vic2V0cyBvZiBkYXRhLiJ9XX1d

Team Members:¶

  • Alua Onayeva
  • Aslan Askarbek
  • Rakhat Zhangabay

mm

In [19]:
from jupyterquiz import display_quiz
import numpy as np

At this stage, having acquired the necessary data and a preliminary idea about the models you intend to train and implement, you might be contemplating what steps to take next. How can you ensure that your model's performance isn't merely a consequence of chance in the data selection process? How do you determine if the selected model outperforms others? Additionally, what measures should be considered if the available data is limited, potentially leading to overfitting in the models?

The answer to all these questions is simple - use Cross-Validation.

Cross-validation¶

Cross-validation is a method for assessing the effectiveness of a machine learning model in which the data is divided into several subsets. The model is trained on one piece of data and tested on another. We repeat the process several times to ensure that the model generalizes well to different parts of the data, and not just to a specific subset. This helps to obtain a more objective assessment of the model's effectiveness.

For sure, the chosen approach will take into account small/large, balanced/unbalanced datasets, and time series/non-time series data.

The main purpose¶

  • Better Performance Evaluation: Since it gives a more precise estimation of the model's ability to generalize to unseen data compared to a single train-test split.

  • Hyperparameters Settings: Through cross-validation and hyperparameter tuning, it can seen how the model performs across several folds. It can help in identifying hyperparameters with better performance.

  • Overfitting Avoidance: Hyperparameter tuning without cross-validation might lead to overfitting to a specific train-test split. Cross-validation mitigates this risk by evaluating hyperparameters across various data subsets, ensuring better generalization.

W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcHVycG9zZSBvZiBjcm9zcy12YWxpZGF0aW9uIGluIG1hY2hpbmUgbGVhcm5pbmc/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJUbyB0cmFpbiBtb2RlbHMgZmFzdGVyIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkNyb3NzLXZhbGlkYXRpb24gaXNuJ3QgcHJpbWFyaWx5IHVzZWQgdG8gdHJhaW4gbW9kZWxzIGZhc3Rlci4gSXRzIHByaW1hcnkgcHVycG9zZSBsaWVzIGluIGFzc2Vzc2luZyB0aGUgbW9kZWwncyBwZXJmb3JtYW5jZSByYXRoZXIgdGhhbiBleHBlZGl0aW5nIHRoZSB0cmFpbmluZyBwcm9jZXNzLiJ9LCB7ImFuc3dlciI6ICJUbyBhc3Nlc3MgYSBtb2RlbCdzIHBlcmZvcm1hbmNlIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ3Jvc3MgVmFsaWRhdGlvbiBhaWRzIGluIGVzdGltYXRpbmcgdGhlIG1vZGVsJ3MgcGVyZm9ybWFuY2Ugb24gdW5zZWVuIGRhdGEgYW5kIHByZXZlbnRzIG92ZXJmaXR0aW5nIG9yIHVuZGVyZml0dGluZy4ifSwgeyJhbnN3ZXIiOiAiVG8gcmVkdWNlIHRoZSBudW1iZXIgb2YgZmVhdHVyZXMiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiQ3Jvc3MtdmFsaWRhdGlvbiBkb2Vzblx1MjAxOXQgZGlyZWN0bHkgZm9jdXMgb24gcmVkdWNpbmcgdGhlIG51bWJlciBvZiBmZWF0dXJlcy4gRmVhdHVyZSBzZWxlY3Rpb24gb3IgZGltZW5zaW9uYWxpdHkgcmVkdWN0aW9uIHRlY2huaXF1ZXMgYXJlIHNwZWNpZmljYWxseSBlbXBsb3llZCBmb3IgdGhpcyBwdXJwb3NlLiAifSwgeyJhbnN3ZXIiOiAiVG8gaW5jcmVhc2UgbW9kZWwgY29tcGxleGl0eSIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJDcm9zcy12YWxpZGF0aW9uIGRvZXNuJ3QgYWltIHRvIGluY3JlYXNlIG1vZGVsIGNvbXBsZXhpdHkuIEluIGZhY3QsIGl0IGhlbHBzIGluIHByZXZlbnRpbmcgb3Zlcmx5IGNvbXBsZXggbW9kZWxzIGJ5IGFzc2Vzc2luZyB0aGVpciBwZXJmb3JtYW5jZSBvbiBkaWZmZXJlbnQgc3Vic2V0cyBvZiBkYXRhLiJ9XX1d

In [2]:
display_quiz("#qqq1")

Train and Test splits¶

Idea: Randomly divide the data into training and test data, the same for all models. The quality of the models and resistance to overfit are checked using test data. This is a common choice and a quick to go validation method.

Commonly used values: 80% training and 20% test, 70% training and 30% test.

However, this approach has major weaknesses - direct dependence on which data was included in the train and which in the test groups, and the following approaches solve this problem.

In [3]:
import warnings
warnings.filterwarnings("ignore")

We will use Cardivascular Heart disease dataset. It includes information on age, gender, height, weight, blood pressure values, cholesterol levels, glucose levels, smoking habits and alcohol consumption of over 70 thousand individuals.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('heart_data.csv\heart_data.csv')
X = df.drop("cardio", axis=1)
y = df["cardio"]
X_train,X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=51)
print(X_train.shape)
print(X_test.shape)
(56000, 13)
(14000, 13)
In [35]:
df.head()
Out[35]:
index id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
0 0 0 18393 2 168 62.0 110 80 1 1 0 0 1 0
1 1 1 20228 1 156 85.0 140 90 3 1 0 0 1 1
2 2 2 18857 1 165 64.0 130 70 3 1 0 0 0 1
3 3 3 17623 2 169 82.0 150 100 1 1 0 0 1 1
4 4 4 17474 1 156 56.0 100 60 1 1 0 0 0 0

As you can see below, the distribution of the classes is approximately the same, thus this dataset can be considered as a balanced.

In [5]:
y.value_counts()
Out[5]:
0    35021
1    34979
Name: cardio, dtype: int64

Types of Cross-validation¶

K-Fold¶

The main idea of this approach is to split the whole dataset in $K$ parts of equal size and each partition is called a fold.

One fold is used for validation and other $K-1$ folds are used for training the model. To use every fold as a validation set and other left-outs as a training set, this technique is repeated $k$ times until each fold is used once. This approach results in every observation being used both in train and test groups.

k-fold

Standard values:

Most commonly, the number of folds used are 5 or 10.

This validation technique is not considered suitable for imbalanced datasets as the model will not get trained properly owing to the proper ratio of each class's data. This issue could be resolved using the following method.

In [6]:
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

kf =KFold(n_splits=5, shuffle=True, random_state=42)

count = 1
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{count}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    count += 1
Fold:1, Train set: 56000, Test set:14000
Fold:2, Train set: 56000, Test set:14000
Fold:3, Train set: 56000, Test set:14000
Fold:4, Train set: 56000, Test set:14000
Fold:5, Train set: 56000, Test set:14000

Now we will apply this method on different models, evaluate accuracy for every fold and output the mean.

Logistic Regression¶

In [7]:
from sklearn.linear_model import LogisticRegression

score = cross_val_score(LogisticRegression(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.69964286 0.69935714 0.69678571 0.69014286 0.698     ]
Average score: 0.70

Random Forest¶

In [8]:
from sklearn.ensemble import RandomForestClassifier

score = cross_val_score(RandomForestClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.72757143 0.72835714 0.72542857 0.72807143 0.72271429]
Average score: 0.73

Gradient Boosting¶

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

score = cross_val_score(GradientBoostingClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.73685714 0.73642857 0.73307143 0.73392857 0.73571429]
Average score: 0.74
Q. Why is it important to use cross-validation with Gradient Boosting? Due to the algorithm of gradient boosting - it tends to overfit rapidly. The boosting learns in each iteration on errors of previous iterations, reducing error on train data with each step until meets stopping criterion for the model. In order to get the best stopping criterion, cross-validation should be applied.

W3sicXVlc3Rpb24iOiAiV2hhdCBkb2VzIHRoZSB0ZXJtICdLJyByZXByZXNlbnQgaW4gSy1Gb2xkIENyb3NzLVZhbGlkYXRpb24/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJUaGUgbnVtYmVyIG9mIGZlYXR1cmVzIGluIHRoZSBkYXRhc2V0IiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIldyb25nOigifSwgeyJhbnN3ZXIiOiAiVGhlIG51bWJlciBvZiB0aW1lcyB0aGUgbW9kZWwgaXMgdHJhaW5lZCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJXcm9uZzooIn0sIHsiYW5zd2VyIjogIlRoZSBudW1iZXIgb2Ygc3Vic2V0cyBpbnRvIHdoaWNoIHRoZSBkYXRhIGlzIGRpdmlkZWQiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJSaWdodCEifSwgeyJhbnN3ZXIiOiAiVGhlIG51bWJlciBvZiBlcG9jaHMgZHVyaW5nIG1vZGVsIHRyYWluaW5nIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIldyb25nOigifV19XQ==

In [10]:
display_quiz("#qqq2")

W3sicXVlc3Rpb24iOiAiSW4gYSAxMC1mb2xkIGNyb3NzLXZhbGlkYXRpb24gb24gYSBkYXRhc2V0IHdpdGggMTAwMCBzYW1wbGVzLCBob3cgbWFueSBzYW1wbGVzIGFyZSB1c2VkIGZvciB2YWxpZGF0aW9uIGluIGVhY2ggZm9sZD8iLCAidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJzIjogW3siYW5zd2VyIjogIjEwIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIldyb25nOigifSwgeyJhbnN3ZXIiOiAiOTAiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiV3Jvbmc6KCJ9LCB7ImFuc3dlciI6ICIxMDAiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJSaWdodCEgSW4gMTAtZm9sZCBjcm9zcy12YWxpZGF0aW9uLCB0aGUgZGF0YXNldCBpcyBkaXZpZGVkIGludG8gMTAgZXF1YWwgcGFydHMgb3IgZm9sZHMuIER1cmluZyBlYWNoIGl0ZXJhdGlvbiwgb25lIGZvbGQgaXMgdXNlZCBmb3IgdmFsaWRhdGlvbiwgd2hpbGUgdGhlIHJlbWFpbmluZyBuaW5lIGZvbGRzIGFyZSB1c2VkIGZvciB0cmFpbmluZyB0aGUgbW9kZWwuICJ9LCB7ImFuc3dlciI6ICIxMTAiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiV3Jvbmc6KCJ9XX1d

In [11]:
display_quiz("#qqq3")

KFold Model Tuning¶

Logistic Regression¶

Applying different algorithms in Logistic Regression.

In [12]:
algorithms = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

for algo in algorithms:
    score = cross_val_score(LogisticRegression(max_iter= 500, solver= algo, random_state= 42), X, y, cv= kf, scoring="accuracy")
    print(f'Average score({algo}): {"{:.3f}".format(score.mean())}')
Average score(newton-cg): 0.721
Average score(lbfgs): 0.697
Average score(liblinear): 0.707
Average score(sag): 0.668
Average score(saga): 0.647

Random Forest¶

Sorting out different maximum leaf nodes.

In [13]:
max_leaf_nodes = [None, 5, 10, 15, 20]

for val in max_leaf_nodes:
    score = cross_val_score(RandomForestClassifier(max_leaf_nodes= val, random_state= 42), X, y, cv= kf, scoring="accuracy")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')
Average score(None): 0.726
Average score(5): 0.723
Average score(10): 0.726
Average score(15): 0.727
Average score(20): 0.728

Gradient Boosting¶

Also i can iterate through different types of parameters

In [14]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 7],
}

grid_search = GridSearchCV(GradientBoostingClassifier(random_state=42), params, cv=kf, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)
{'max_depth': 3, 'n_estimators': 100}
In [15]:
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
print(predictions)
[1 1 1 ... 1 1 0]

Stratified K-Fold¶

The approach is called for the term Stratum - divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is randomly sampled.

This is an enhanced version of the k-fold cross-validation technique. Although it too splits the dataset into k equal folds, each fold has the same ratio of instances of target variables that are in the complete dataset, helping to generalize each fold.

This enables it to work perfectly for imbalanced datasets, but not for time-series data.

k-fold

Performance Comparison¶

In [17]:
import matplotlib.pyplot as plt

kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = [
    ('Logistic Regression', LogisticRegression()),
    ('Gradient Boosting', GradientBoostingClassifier()),
    ('Random Forest', RandomForestClassifier())
]

kfold_scores = []
stratified_kfold_scores = []

for name, model in models:
    kfold_scores.append(cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy').mean())
    stratified_kfold_scores.append(cross_val_score(model, X_train, y_train, cv=skf, scoring='accuracy').mean())
In [20]:
fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.4
bar_positions_kfold = np.arange(len(models))
bar_positions_stratified_kfold = bar_positions_kfold + bar_width
ax.bar(bar_positions_kfold, kfold_scores, bar_width, label='K-Fold')
ax.bar(bar_positions_stratified_kfold, stratified_kfold_scores, bar_width, label='Stratified K-Fold')
ax.set_xticks(bar_positions_kfold + bar_width / 2)
ax.set_xticklabels([model[0] for model in models])
ax.set_xlabel('Models')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Mean Accuracy of Models under Different Cross-Validation Techniques')
ax.legend()

def autolabel(bars):
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.2%}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(ax.patches)

plt.show()
Q. Comment on the results, guess why the results are not very different? If you scroll up, there will be a note that the dataset is balanced. Therefore, the feature of Stratified K-Fold is no longer needed.

W3sicXVlc3Rpb24iOiAiV2hpY2ggb2YgdGhlIGZvbGxvd2luZyBzdGF0ZW1lbnRzIGFib3V0IFN0cmF0aWZpZWQgSy1Gb2xkIENyb3NzLVZhbGlkYXRpb24gaXMgdHJ1ZT8iLCAidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJzIjogW3siYW5zd2VyIjogIkl0IHNodWZmbGVzIHRoZSBkYXRhIGJlZm9yZSBzcGxpdHRpbmcgaW50byBmb2xkcyIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJTaHVmZmxpbmcgY2FuIGJlIGVtcGxveWVkIGluIHZhcmlvdXMgY3Jvc3MtdmFsaWRhdGlvbiBtZXRob2RzLCBidXQgaXQncyBub3QgaW5oZXJlbnQgdG8gdGhlIGRlZmluaXRpb24gb2YgU3RyYXRpZmllZCBLLUZvbGQuIn0sIHsiYW5zd2VyIjogIkl0IHByZXNlcnZlcyB0aGUgcGVyY2VudGFnZSBvZiBzYW1wbGVzIGZvciBlYWNoIGNsYXNzIGluIGV2ZXJ5IGZvbGQiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJTdHJhdGlmaWVkIEstRm9sZCBDcm9zcy1WYWxpZGF0aW9uIGVuc3VyZXMgdGhhdCBlYWNoIGZvbGQgb2YgdGhlIGNyb3NzLXZhbGlkYXRpb24gcmV0YWlucyB0aGUgc2FtZSBjbGFzcyBkaXN0cmlidXRpb24gYXMgdGhlIG9yaWdpbmFsIGRhdGFzZXQuICJ9LCB7ImFuc3dlciI6ICJJdCBvbmx5IHdvcmtzIHdpdGggcmVncmVzc2lvbiBwcm9ibGVtcyIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJTdHJhdGlmaWVkIEstRm9sZCBDcm9zcy1WYWxpZGF0aW9uIGlzIGNvbW1vbmx5IHVzZWQgaW4gY2xhc3NpZmljYXRpb24gdGFza3MuIn0sIHsiYW5zd2VyIjogIkl0IHJhbmRvbWx5IHNlbGVjdHMgYSBzaW5nbGUgc2FtcGxlIGFzIHRoZSB2YWxpZGF0aW9uIHNldCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJTdHJhdGlmaWVkIEstRm9sZCBDcm9zcy1WYWxpZGF0aW9uIGRvZXMgbm90IHJhbmRvbWx5IHNlbGVjdCBhIHNpbmdsZSBzYW1wbGUgZm9yIHZhbGlkYXRpb24uIn1dfV0=

In [23]:
display_quiz("#qqq5")

Leave-One-Out Cross Validation:¶

This approach is used with a similar idea as K-Fold, in fact, we could even say that LooCV technique is identical to K-Fold of size N (whole dataset size). Should be mentioned, the intuitive difference between Leave-One-Out or Leave-P-Out is: While in K-Fold we specify the number of groups and size of groups themselves is defined according to the data size, in Leave-out methods we specify the size of the validation set itself and the number of groups will be defined according to the data size.

Leave-One-Out Cross Validation

W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgbWFpbiBkaXNhZHZhbnRhZ2Ugb2YgTGVhdmUtT25lLU91dCBDcm9zcy1WYWxpZGF0aW9uIGNvbXBhcmVkIHRvIG90aGVyIGNyb3NzLXZhbGlkYXRpb24gbWV0aG9kcz8iLCAidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJzIjogW3siYW5zd2VyIjogIkl0IHJlcXVpcmVzIG1vcmUgY29tcHV0YXRpb25hbCByZXNvdXJjZXMiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJSaWdodCEgTE9PQ1YgaW52b2x2ZXMgY3JlYXRpbmcgYXMgbWFueSBmb2xkcyBhcyB0aGVyZSBhcmUgc2FtcGxlcyBpbiB0aGUgZGF0YXNldCwgcmVzdWx0aW5nIGluIGEgbGFyZ2UgbnVtYmVyIG9mIG1vZGVsIGZpdHMuICJ9LCB7ImFuc3dlciI6ICJJdCB0ZW5kcyB0byBvdmVyZml0IHRoZSBtb2RlbCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJMT09DViBkb2VzIG5vdCBpbmhlcmVudGx5IGxlYWQgdG8gb3ZlcmZpdHRpbmcuIn0sIHsiYW5zd2VyIjogIkl0IGlzIGJpYXNlZCB0b3dhcmRzIHNtYWxsZXIgZGF0YXNldHMiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTE9PQ1YgaXNuJ3QgcGFydGljdWxhcmx5IGJpYXNlZCB0b3dhcmRzIHNtYWxsZXIgZGF0YXNldHMuICJ9LCB7ImFuc3dlciI6ICJJdCBjYW4gYmUgY29tcHV0YXRpb25hbGx5IHNsb3cgZm9yIGxhcmdlIGRhdGFzZXRzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIml0J3MgdHJ1ZSB0aGF0IExPT0NWIGNhbiBiZSBjb21wdXRhdGlvbmFsbHkgc2xvdywgZXNwZWNpYWxseSBmb3IgbGFyZ2VyIGRhdGFzZXRzLCB0aGlzIGlzbid0IGl0cyBwcmltYXJ5IGRpc2FkdmFudGFnZS4ifV19XQ==

In [21]:
display_quiz("#qqq4")

Q. Why it is important to use cross-validation on Gradient Boosting?

{admonition} Answer
:class: tip, dropdown

Due to the algorithm of gradient boosting - it tends to overfit rapidly. The boosting learns in each iteration on errors of previous iterations, reducing error on train data with each step until meets stopping criterion for the model. In order to get the best stopping criterion, cross-validation should be applied

Learning Curves¶

In [36]:
import numpy as np
import plotly.graph_objs as go
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

models = {
    'Logistic Regression': LogisticRegression(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Random Forest': RandomForestClassifier()
}

def plot_learning_curves(models, X, y, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    colors = ['blue', 'green', 'red'] 
    
    data = []
    color_index = 0  
    for name, model in models.items():
        train_sizes, train_scores, test_scores = learning_curve(
            model, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
        
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)
        
        
        trace1 = go.Scatter(
            x=train_sizes, y=train_scores_mean,
            mode='lines+markers',
            name=f"{name} (Training score)",
            line=dict(color=colors[color_index])  
        )
        trace2 = go.Scatter(
            x=train_sizes, y=test_scores_mean,
            mode='lines+markers',
            name=f"{name} (Cross-validation score)",
            line=dict(color=colors[color_index])  
        )
        color_index += 1  
        
        
        trace3 = go.Scatter(
            x=np.concatenate([train_sizes, train_sizes[::-1]]),
            y=np.concatenate([train_scores_mean - train_scores_std,
                              (train_scores_mean + train_scores_std)[::-1]]),
            fill='tozerox',
            fillcolor='rgba(0,100,80,0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            showlegend=False
        )
        trace4 = go.Scatter(
            x=np.concatenate([train_sizes, train_sizes[::-1]]),
            y=np.concatenate([test_scores_mean - test_scores_std,
                              (test_scores_mean + test_scores_std)[::-1]]),
            fill='tozerox',
            fillcolor='rgba(255,140,0,0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            showlegend=False
        )
        
        data.extend([trace1, trace2, trace3, trace4])
    
    layout = go.Layout(
        title='Learning Curves for Different Models',
        xaxis=dict(title='Training examples'),
        yaxis=dict(title='Score'),
        legend=dict(x=0.7, y=1.1)
    )

    fig = go.Figure(data=data, layout=layout)
    fig.show(renderer='notebook')
    


plot_learning_curves(models, X_train, y_train, cv=5)

In this graph you can see how the performance of the model changes with the amount training data and how different Cross-validation methods converge.

General note comparing above approaches¶

Approach Execution speed Efficient with small datasets Efficient with large datasets
Train Test split ✔ ✕ ✔
K-Fold ✕ ✔ ✔
LooCV ✕ ✔ ✔